Student Score Prediction Machine Learning

XGBoost | GridSearchCV

a. Introduction

This page documents the README for using the machine learning model I deployed for predicting student scores. Unfortunately, I do not have permission to disclose the dataset. The instructions on how to use the functions (data engineering, encoding and imputation based on YAML file configurations, hyperparameter tuning, model validation, model prediction) built around the machine learning model developed are detailed here, along with set-up and overall architecture of the software. The code can be found on my GitHub.

b. Overview of the submitted folder and the folder structure (after running run.sh) and logging grid search CV results.

Overview of Folder Structure

📦AIAP_ASSESSMENT
┣ 📂data
┃ ┣ 📜score.csv
┃ ┣ 📜score.db
┃ ┗ 📜score_processed.csv
┣ 📂logs
┃ ┣ 📜gscv_test.yml
┃ ┗ 📜gscv_test_log.yml
┣ 📂src
┃ ┣ 📜config.yml
┃ ┣ 📜grid_search.py
┃ ┣ 📜main.py
┃ ┣ 📜make_prediction.py
┃ ┗ 📜make_validation.py
┣ 📜eda.ipynb
┣ 📜README.md.txt
┣ 📜requirements.txt
┗ 📜run.sh

c. Instructions for executing the pipeline and modifying any parameters.

Before attempting to execute the pipeline, complete the following relevant set-up procedures for terminal of choice:

– Set-up for Git Bash –

Install Git Bash
Configure Git Bash to allow the following commands by setting the necessary PATH extensions:
– curl
– sqlite3
– pip
– python

Note: If Anaconda is being used to run python and ‘python’ command cannot be found by Git Bash, run this command to set the PATH extension to Anaconda’s python.exe: ‘PATH=$PATH:/D/ANACONDA/’

– Set-up for Linux –

Configure Linux Terminal to allow the following commands by using “sudo apt install ‘command’”:
– curl
– sqlite3
– pip
– python

Once the set-up is complete, open the ‘config.yml’ file found in the ‘src’ folder. This configuration file contains some of the parameters that will be used by the models and data processing.

Modifying Parameters in ‘config.yml’

‘config.yml’ contains all configurations for the machine learning pipeline and its functions.The full breakdown of the file is as seen below:

xgbr is an abbreviation for XGBRegressor() and ‘xgbr_parameters’ is a dictionary which contains the parameters for configuring an instance of the model. The parameters are accessed when you set ‘use_configured’ to ‘y’ when running ‘python make_validation.py’.

xgbr_parameters: 
  objective: "reg:squarederror"
  learning_rate: 0.2
  min_split_loss: 0.01
  max_depth: 7
  min_child_weight: 1.0
  subsample: 1
  reg_lambda: 1.0
  reg_alpha: 0
  scale_pos_weight: 1.0
  max_bin: 256
  booster: "gblinear"
  max_delta_step: 0
  verbosity: 2

xgbc is an abbreviation for XGBClassifier() and ‘xgbc_parameters’ is a dictionary which contains the parameters for configuring an instance of the model. The parameters are accessed when you set ‘use_configured’ to ‘y’ when running ‘python make_validation.py’.

xgbc_parameters:
  objective: "multi:softprob"
  learning_rate: 0.2
  min_split_loss: 0.01
  max_depth: 7
  min_child_weight: 1.0
  subsample: 1
  reg_lambda: 1.0
  reg_alpha: 0
  scale_pos_weight: 1.0
  max_bin: 256
  booster: "gbtree"
  max_delta_step: 0
  verbosity: 2

feature_parameters is a setting for the data processing in the pipeline, set to True (boolean) to include engineered features, set to False (boolean) to exclude engineered features.

feature_parameters:
  include: True

impute_parameters is a setting for the data processing in the pipeline, set to ‘iter’ to use iterative imputation for NaN data, set to ‘simp’ to use simple (mean) imputation for NaN data.

impute_parameters:
  impute_type: "simp"

grid_search_parameters is a dictionary containing grid search CV (cross validation) parameters for different models. Currently, the included models are XGBRegressor and XGBClassifier. Each model has a dictionary within the grid_search_parameters dictionary which will be accessed when you select a specific model to run grid search CV on. Specify the different values in the lists below for grid search to form a cross multiplication of all parameter values before testing out each combination. The parameters are accessed when you run ‘python grid_search.py’.

Note: Grid search cross validation can take a long time, set verbose (optional argument) to 3 to receive updates on its progress*

grid_search_parameters:
  xgbr:
    learning_rate: [0.05]
    min_split_loss: [0]
    max_depth: [20]
    min_child_weight: [0]
    subsample: [1.0]
    reg_lambda: [1.0]
    reg_alpha: [0]
    scale_pos_weight: [0.05, 0.1]
    max_bin: [256]
    booster: ["gbtree"]
    max_delta_step: [0]
    n_estimators: [200]

  xgbc:
    learning_rate: [0.1, 0.2]
    min_split_loss: [0.01]
    max_depth: [7]
    min_child_weight: [1.0]
    subsample: [0.9]
    reg_lambda: [1.0]
    reg_alpha: [0]
    scale_pos_weight: [1.0]
    max_bin: [256]
    booster: ["gbtree"]
    max_delta_step: [0]
    n_estimators: [200]

validation_parameters is a dictionary containing parameters which will be used for repeated K-Fold validation for model validation when you run ‘python make_validation.py’

validation_parameters:
 n_splits: 5
 n_repeats: 2

prediction_parameters is a dictionary containing parameters which will be used for prediction when you run ‘python make_validation.py’

Note If there are unknown parameters, input a best estimate of population average
All inputs are to be type float or str (add .0 if it is meant to be an integer)

prediction_parameters:
 age: 16.0
 student_id: C130
 number_of_siblings: 2.0
 n_male: 0.0
 n_female: 3.0
 hours_per_week: 10.0
 attendance_rate: 10.0
 sleep_time: '22:00'
 wake_time: '6:00'
 direct_admission: 'No'
 CCA: Arts
 learning_style: Visual
 gender: Female
 tuition: 'Yes'
 mode_of_transport: "walk"
 bag_color: rainbow
 final_test: unknown

Details on the XGBoost parameters available here: XGBoost Documentation

Executing the Pipeline

Navigate to the base directory containing ’run.sh’ on a terminal and key in 'bash run.sh'. The database will be downloaded, packages required for running the scripts will be installed, and additional .csv files containing processed data will be generated to support scripts. A series of short tests will also be run on the different python functions which can be called through the terminal, introducing some of the utility of the pipeline while probing for errors.

Note: All python scripts were written tested using Python ver==3.8 in an Anaconda environment.

# Running the run.sh on terminal
# In the Terminal, navigate to the directory containing 'run.sh'
# 'chmod u+x run.sh ' marks the shell script as executable and grants permissions to the file 'run.sh' to install packages.

chmod u+x run.sh 

# 'bash run.sh' executes the shell file. The bash file will begin importing libraries and some tests immediately.

bash run.sh  

# If you encounter any ImportErrors (e.g. 'ImportError: DLL load failed while importing qhull: The specified module could not be found.'). Uninstall the package from your environment and change the version of the package in 'requirements.txt' to the latest version available online before running 'bash run.sh' again.

Pipeline Functions

Enter the ‘src’ directory through Git Bash to gain access to the following scripts (type -h for more info on the arguments).
The random state of all functions have been kept constant, hence for a specific train-test split, simply keep the ratio the same to keep datasets the same. Each function will be explained in detail in the following subsections.

Parameters for grid_search.py

Specify the parameters for grid search cross validation to run a search on. All parameters can be found in the ‘config.yml’ under ‘grid_search_parameters’.

positional arguments	details/explanation
estimator	input the specific model for parameter optimization (e.g. xgbr,xgbc)
scoring	input the specific scoring metric for parameter optimization, full list available at: sklearn parameters
model_type	input the model type (regression/classification)
ratio	input the ratio for train-validation split (float will indicate proportion allocated to training set)

optional arguments	details/explanation
-h, --help	show this help message and exit
-filename	input the log filename e.g. ‘xgbr1’, logged files contain the best parameters and corresponding score for each run and are stored in the ‘logs’ folder as ‘.yml’ files
-cv	input the number of folds of cross validation to use during optimization, default=5
-n_jobs	input number of processors to use for search (-1 for max, else specify int)
-verbose	input verbosity level (ascending verbosity from 1-3), default=0

# Sample command to run on terminal
python grid_search.py xgbr neg_mean_squared_error regression 0.8 -filename cheese3

Parameters for make_validation.py

Specify the parameters to run a K-Fold validation on a specified model. The configured model parameters can be found in ‘config.yml’ under ‘modelname_parameters’.

positional arguments	details/explanation
estimator	input the specific model for parameter optimization (e.g. xgbr,xgbc)
scoring	input the specific scoring metric for parameter optimization, full list available at: sklearn parameters
model_type	input the model type (regression/classification)
ratio	input the ratio for train-validation split (float will indicate proportion allocated to training set)

optional arguments	details/explanation
-h, --help	show this help message and exit
-filename	input the log filename e.g. ‘xgbr1’, logged files contain the best parameters and corresponding score for each run and are stored in the ‘logs’ folder as ‘.yml’ files
-cv	input the number of folds of cross validation to use during optimization, default=5
-n_jobs	input number of processors to use for search (-1 for max, else specify int)
-use_configured	use configured model parameters (‘y’) or use default parameters (‘n’), default=‘y’
-verbose	input verbosity level (ascending verbosity from 1-3), default=0

# Sample command to run on terminal
python make_validation.py xgbr neg_mean_squared_error regression 0.8 -use_configured n

Parameters for make_prediction.py

Specify the parameters to run a prediction or test using a specified model. The input parameters for prediction can be found in ‘config.yml’ under ‘prediction_parameters’.

positional arguments	details/explanation
estimator	input the specific model for parameter optimization (e.g. xgbr,xgbc)
scoring	input the specific scoring metric for parameter optimization, full list available at: sklearn parameters
model_type	input the model type (regression/classification)
ratio	input the ratio for train-validation split (should match all previous ratios used)

optional arguments	details/explanation
-h, --help	show this help message and exit
-testing	input your choice, testing or predicting (t/p), default=“t”
-n_jobs	input number of processors to use for search (-1 for max, else specify int)
-use_configured	use configured model parameters (‘y’) or use default parameters (‘n’), default=‘y’
-verbose	input verbosity level (ascending verbosity from 1-3), default=0

# Sample command to run on terminal
python make_prediction.py xgbr neg_mean_squared_error regression 0.9 -use_configured n -testing p

d. Description of logical steps/flow of the pipeline.

Machine Learning Pipeline Visualization

e. Overview of key findings from the EDA conducted in Task 1 and the choices made in the pipeline based on these findings, particularly any feature engineering. Please keep the details of the EDA in the `.ipynb`, this section should be a quick summary.

Key Findings from EDA

Missing values in final_test | Rows with missing data removed
Differently labelled data in some features (e.g. CCA, tuition, age) | Relabelled based on logic
hours_per_week, number_of_siblings, direct_admission were highly correlated to final_test based on Pearson’s and other correlation charts | Paid more attention to these features during feature analysis
Strange pattern in duplication of certain entries present | Removed ‘almost-duplicates’ conditionally
wake_time and sleep_time apparently unrelated to scores | Created sleep_hours to make use of data
tuition did not display strong correlation to scores | privilege feature was created using tuition status and number of siblings
Some parameters were logically not linked to scores (age, bag_color, student_id) | Removed from dataset in pipeline after statistically confirming irrelevance.
mode_of_transport is closely related to wake_time which is closely related to sleep_time | Considered removing features if too strongly correlated and does not produce additional information
Adaptive threshold for generating classification model labels based on score percentiles | Most relevant to the goals and resources of the schools
Anomalies identified from hous_per_week and final_test score interaction | Removed anomalies

Engineered Features (Kept):

Engineered Feature	Logic
sleep_hours based on sleep_time and wake_time	sleep = health = memory + attendance = good score
privilege based on number of siblings and tuition	((number_of_siblings = large) + (tuition = none)) = underprivileged
female_class and male_class to identify single-sex classes	(good single-sex schools in Singapore = potential indicator of test scores)/(single-sex schools = more focussed students)
class_size based on n_male and n_female	larger classes = less attention from teachers and less resources from school = less support from school

Note that all decisions were supported by visualizations and statistics (in EDA) on top of the logic.

Engineered Features (Discarded):

Engineered Feature	Logic
gender_ratio in class based on n_male and n_female	(unbalanced class gender might affect attention)/(alternative logic: unbalanced class gender might affect resource distribution
n_male_cat and n_female_cat categories based on number of students of specific gender in class	capture class sizes categorically, but too closely related to n_male and n_female

f. Explanation of your choice of models for each machine learning task.

Model Choice

Data Size and Memory

This dataset has moderate dimensionality, small data size, data on different scales (which can be scaled if desired), many numpy.zeros, around 5% of missing values (non-label features) which may or may not benefit from imputation and for the context of this problem. Additionally, there is the requirement for a regression model and a classification model.
XGBoost is a good candidate for dealing with the above conditions given that it is not sensitive to scale, efficient at storing numpy.zeros, able to ignore NaN data while still making use of other data within the row and has both a regression model and classification model ready for parameter tuning.

Problem Characteristics

First off, this is a supervised learning problem, which means the goal is to use the data given and the corresponding labels to build a model which is capable of predicting a outputs based on sets of data given, this means all unsupervised learning models do not need to be considered. Secondly, the moderate dimensionality and many non-linear relationship between/within features are signs that all linear models should not be considered. Decision tree based models are good for modelling low-dimensionality, simple problems with a limited space, but quickly become ineffective once the problem has a moderate number of parameters. Lastly, predicting exam scores can be extremely complex as exams are not environments of consistency. Even if a model has successfully identified that a student was supposed to perform well based on the features, it is possible that in the final exam the student fumbles due to stress, carelessness or inability to focus. Hence a decision tree ensemble or forest that is capable of modelling the complexity of the problem while maintaining a high degree of flexibility for regularization so as not to overfit on features that have an inherently unpredictable output is required. This is where XGBoost comes in!

XGBoost (Extreme Gradient Boosting) uses an ensemble of gradient boosted trees to model the problem. There is a good number of parameters to tune the XGBoost. The number of trees, depth of the trees and the number of leaves the tree can have can be adjusted to suit the complexity of the model (more complex models require more trees and a larger depth to model appropriately simply because there are more parameters to factor in each tree). But more importantly for the context of this problem, the ways which the model can be regularized are abundant. Starting from the selection of random subsamples (rows) for the training of each tree, selection of a fraction of columns for the training of each level of the tree, the minimum amount of gain (also known as gamma or loss reduction) required for a tree to create an additional branch, the minimum total number of data instances in each leaf to be considered for branching, down to the lamda and alpha regularization terms which affect the gain calculation formula for each step of tree branching (thereby tuning how ‘finely’ each tree branches out) - the model has the right amount of regularization capabilities for this problem.

g. Evaluation of the models developed . Any metrics used in the evaluation should also be explained.

Classification Labels

For the classification model, a label needed to be generated based on the data. Since education is about equalizing, but resources are limited and resource allocation is about optimization, the students were labelled based on scoring percentiles. Those scoring below a certain percentile will be considered as ‘requiring support/attention’. This makes more sense than setting a raw score because the resources should be allocated in proportion to neediness but the number of needy students a school can support is in reality limited. Hence the schools should focus more on those who are performing worst in their school as compared to those who are performing below some specific score (as this might diffuse the attention and resources from those who need it most).

For this project, the percentiles used to label the students (the label was named final_grade) are as follows:

final_grade	Percentile	Score Band
1	0.1	[0 - 48]
2	0.3	(48 - 58)
3	0.6	[58 - 71)
4	0.6	[71 - 100]

Evaluation Metrics

There was insufficient time to run a thorough grid search cross validation for the models to tune the parameters. However, without tuning, the vanilla XGBRegressor() and XGBClassifier() models produced reasonable results during K-Fold Cross Validation and Testing.

Metric	XGBRegressor()	XGBClassifier()
Mean Squared Error (squared error between prediction and actual value)	28 - 32	—
Accuracy (number of correcly predicted labels/number of tested labels)	—	78 - 80%

Based on experiments in the EDA, the sensitivity of classifying the final_grade=1 students was 70.7%, which is not as good enough for deployment. The full confusion matrix is as seen below:
XGBClassifier Confusion Matrix

From the confusion matrix above, it is apparent that the model is showing a bias towards classifying students in higher into grades.

h. Other considerations for deploying the models developed.

Model Deployability

XGBClassifier()

We can see that the model has difficulty classifying students in their exact final_grade category (e.g for final_grade=1 students only 104/147 (70.7%) of the students were correctly categorized - it is not very sensitive to the performance of final_grade=1 students). Given that we know that the number of the students that belong in final_grade=1 based on the percentiles we set for the grade thresholds, there is sufficient data in terms of volume (relative to the data set given) to characterize a final_grade=1 student. The poor sensitivity could simply mean it is harder to predict students that are going to perform poorly as opposed to those who will do well (final_grade=4 student prediction has a sensitivity of 538/612 (87.9%). Or it could also mean that the quality of data collected for students belonging to final_grade=1 is poorer (e.g. false data being fed on the number of hours studied per week etc.). Among all the parameters to analyze the the confusion matrix with, sensitivity is the most relevant as the school’s top priority is to prevent students from falling through the cracks (in this case, being falsely classified as negative. To improve the sensitivity towards the characteristics of final_grade=1 students, it would be good to either increase the quantity of data from Grade 1 students in the dataset or to improve the quality of data collected from these students. It would also be good to collect data specific to the identification of final_grade=1 students - perhaps such as “detentions_received” or something similar. From a model-side perspective, identifying the features that help differentiate a Grade 1 student from other students (if such a feature exists) and assigning them a larger weight would address this issue.

If more time was available, experimenting with the exact percentiles to best split the threshold, and the number of thresholds to create would be useful as well - although this is likely to differ from dataset to dataset as it will form school to school. For example, if the final_grade=1 percentile threshold is too high, the characteristics of students who truly need help will be mixed with those who are on the borderline, or perhaps even just average. For this dataset, the threshold corresponds to those who score 48 marks and below for the final exam which is reasonable.

However, knowing that the model has a bias towards giving students a higher grade allows the user of the model to do one simple thing to take advantage of this fact: Take both final_grade=1 and final_grade=2 students as those who should be focused on. Based on the split above, doing so will capture 93.2% (137/147) of the students who require assistance (based on current percentile assumptions), which shows good potential for a model which has not been tuned.

XGBRegressor()

For the context of this problem the XGBRegressor() is slightly easier to deploy in terms of resource allocation. Given a set of data from a population of students, the school need only run a regression on all the student’s data and this will allow them to rank the students individually across the entire dataset. The school can then select students individually for focussed attention. This model is much more useful for detecting outliers than banding students for different treatment. For example, using the regressor, the school can pick the ‘worst 3 scorers’ in advance and focus attention on them e.g. 1-1 consultations. On the reverse side, the school can also use this model to proxy the ‘top 3 performers’ in advance, perhaps to select them for some competition or program.

The deployment of the regressor model can enable schools to make targetted efforts involving a few students at the score extremes after regression. Alternatively, if the same banding principle used for generating the classification model labels is used on the scores predicted by regression, the school can also convert the individual scores into different bands and provide attention and resources to the students accordingly.

Concluding Considerations

Predicting scores can be extremely difficult as exams are not the best environment for consistency. Even if a model has successfully identified that a student was supposed to perform well, it is possible that in the final exam the student fumbles due to stress, carelessness or inability to focus. If the model is expected to capture this information as well, it will be good to take multiple test scores and consolidate their average and variance as a proxy for performance consistency which can then be a feature for training an improved model which factors consistency.

Thank you for taking your time to read this document.

Additional Details

– Pre-requisites for running EDA Jupyter Notebook –

Install Anaconda
Open Anaconda Navigator
Open the Jupyter Notebook file ‘eda.ipynb’ with Jupyter Notebook (version 6.4)
Install packages as necessary if intending to run specific lines of code that require additional packages using ‘conda install package_name’